Modern software architectures based on microservices and serverless architectures bring advantages for application development. Distributed development teams can manage, monitor and operate their individual services more easily. The disadvantage is that they can easily lose sight of the “big picture”. If there are problems in a transaction that is distributed across several microservices, serverless functions and teams, it is almost impossible to distinguish the service responsible for the problem from the affected services. Distributed tracing should provide support here and monitor and make visible the overall system behavior.
Distributed tracing records the paths that a request (from an application or an end user) takes through a distributed application landscape such as microservices or serverless functions. Distributed tracing therefore offers an end-to-end view of the request and the relationships between different services. It is therefore a diagnostic technique that shows how a set of services behaves in order to process individual requests.
The terminology behind distributed tracing
Before we can talk about how distributed tracing works, we need to define some basic terms. To do this, it’s useful to refer to the definitions from OpenTelemetry. OpenTelemetry provides an open source standard and a set of technologies that can be used to capture and export traces from their cloud-native applications and infrastructure.
- Request: A request (also known as a transaction) is the way in which applications communicate in a distributed system landscape. Each service can use a different technology to process the request, e.g. HTTP or MQTT.
- Trace: A trace represents the end-to-end flow of a request through the various services. Each trace consists of several spans.
- Span: In distributed tracing, a span indicates a single unit of work that is executed in a trace, e.g. an API call or a database query. Each service in the flow of a particular request through the distributed system contributes a span with a temporal context. A distinction can be made between two different types of span:
- Root span: The root span (or parent span) is the first span in a trace.
- Child span: All subsequent spans are referred to as child spans.
- Instrumentation: In microservices environments, instrumentation usually refers to the code that is added to a service so that data can be collected. Modern tracing tools usually support instrumentation in multiple languages and frameworks. For Java, for example, instrumentation can be implemented with Spring Cloud Sleuth. The standard configurations (automated setup of spans, traces, etc.) are already adopted by the framework without you having to change anything in the code yourself. This allows various distributed tracing tools such as Jaeger or Zipkin to collect and visualize this data.
- Sampling: Tracing data is often produced in large quantities, it is not only “expensive” to collect and store, but it is also “expensive” to transmit. Sampling is used to strike a balance between monitoring and these costs. It is the process of deciding whether or not to process and export a chip. There are two approaches here:
- Head-based sampling: The sampling decision is made at the very beginning, when the trace starts.
- Tail-based sampling: The sampling decision is only made for the respective trace at the end of the process.
- Trace context: The trace context is used to track the request across services. This is comparable to the shipping label on a parcel. The trace context contains data such as the trace ID, the span ID or various flags such as the sampling flag, which indicates whether the span should be processed or not. The trace context is therefore appended to the metadata of the transport protocol for each request. As can be seen in Table 1, the various providers have different formats for the trace context. Here, however, you should rely on the W3C TraceContext in your implementation. It attempts to establish a standardization in the jungle of different trace contexts, which more and more tool providers are also adapting.
- Exporter: The component that bundles the data of a trace and exports it to the target backend or an endpoint in the required format. This allows distributed tracing tools such as Jaeger or Zipkin to collect, process and visualize the data.
How it all works
Distributed tracing data collection starts from the moment a request reaches a service, e.g. when a user submits a form on the website. The respective instrumentation then triggers the creation of a unique trace ID and a span, in this case the parent span. This first span ID is then the same as the trace ID. With each subsequent service call, the trace ID remains the same and the span ID changes. Each additional span is a child span. This procedure makes it possible to create a tree graph from the data that shows how much time the request has spent in each service, as can be seen in Figure 1 for example.
| Trace context solution | Trace context example |
|—|—|
| Jaeger | uber-trace-id:{trace-id}:{span-id}:{parent-span-id}:{flags} |
| Zipkin | X-B3-TraceId:{trace-id}
X-B3-SpanId:{span-id}
X-B3-ParentSpanId:{parent-span-id}
X-B3-Sampled:{sampleFlag} |
| W3C-TraceContext | traceparent:{version}-{trace-id}-{parent-id}-{trace-flags} |
Table 1: Trace context examples from various providers
Fig. 1: Example of a tree graph
The respective exporter then sends the data to the tracing server. In addition to the span context, further data can be sent along with the trace. This includes tags or logs, for example, which can provide the trace with important additional information for analysis, as shown in Figure 2.
Fig. 2: Example of trace data
Finally, the spans can be visualized in a tracing tool, for example using a flame graph. The parent span is at the top, with all child spans appearing below it in chronological order. An example of this is shown in Figure 3, which shows which service handled the request when and for how long. In our example, it is noticeable that service D “cost” the most time. It might be worth taking a closer look here in order to derive optimizations.
The benefits of distributed tracing
If distributed tracing is used, it should of course bring benefits. The following are the most important advantages of using distributed tracing:
- Minimization of MTTD (mean time to detect) and MTTR (mean time to recover): Distributed tracing helps to reduce the time it takes to identify a problem and the time it takes to fix it. Time is money, so the faster the better.
- Understanding service relationships: Distributed tracing can create an understanding of which services are related to each other. This provides a picture of the cause-and-effect relationship between the services.
- Measuring specific user actions: For example, if a user submits a particular form, this action can be tracked across all services. Bottlenecks, errors etc. can thus be assigned to specific processes.
- Improved collaboration and productivity: A request can run through several services for whose development several teams are responsible. Distributed tracing can efficiently reveal where the error occurred and which team needs to fix the problem.
- Maintenance of SLAs (Service Level Agreements): Most companies have SLAs, which are contracts with the customer for specific service performance targets. Distributed Tracing can aggregate various performance data and helps to assess whether SLAs are being met.
Conclusion
In general, distributed tracing is the best way for DevOps, operations, software and SRE (site reliability engineering) teams to quickly get answers to specific questions in distributed software environments. Once a request involves a handful of microservices, it’s essential to see how all the different services work together. Trace data provides an answer to what is happening across the application. If there were only isolated views of the individual services, there would be no way to reconstruct the flow between individual services. Distributed tracing is a purely technical tool that should be used when you need answers to the following questions:
- Which services are included in the request and for how long?
- What is the cause of errors in a distributed system?
- Where are there performance bottlenecks?
Distributed tracing makes it possible to quickly identify problems. In addition, it helps not to react only when something has already happened. You should proactively look for problems in order to eliminate them before major difficulties arise. It is therefore quite clear that this is not just hype, but a helpful tool for gaining an overview of the “big picture” in the services jungle.
It should be noted that distributed tracing is only one part of the overall monitoring process. It definitely provides information about the services and their relationship to each other. However, it is essential to consider the other two pillars of observability in addition to distributed tracing: distributed metrics and distributed logging. This is the only way to create a business view in addition to the technical view. All three are necessary for an optimal understanding of a distributed application’s performance.